Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent
نویسنده
چکیده
For large scale learning problems, it is desirable if we can obtain the optimal model parameters by going through the data in only one pass. Polyak and Juditsky (1992) showed that asymptotically the test performance of the simple average of the parameters obtained by stochastic gradient descent (SGD) is as good as that of the parameters which minimize the empirical cost. However, to our knowledge, despite its optimal asymptotic convergence rate, averaged SGD (ASGD) received little attention in recent research on large scale learning. One possible reason is that it may take a prohibitively large number of training samples for ASGD to reach its asymptotic region for most real problems. In this paper, we present a finite sample analysis for the method of Polyak and Juditsky (1992). Our analysis shows that it indeed usually takes a huge number of samples for ASGD to reach its asymptotic region for improperly chosen learning rate. More importantly, based on our analysis, we propose a simple way to properly set learning rate so that it takes a reasonable amount of data for ASGD to reach its asymptotic region. We compare ASGD using our proposed learning rate with other well known algorithms for training large scale linear classifiers. The experiments clearly show the superiority of ASGD.
منابع مشابه
Stochastic Optimization for Large-scale Optimal Transport
Optimal transport (OT) defines a powerful framework to compare probability distributions in a geometrically faithful way. However, the practical impact of OT is still limited because of its computational burden. We propose a new class of stochastic optimization algorithms to cope with large-scale OT problems. These methods can handle arbitrary distributions (either discrete or continuous) as lo...
متن کاملLinear Learning with Sparse Data
Linear predictors are especially useful when the data is high-dimensional and sparse. One of the standard techniques used to train a linear predictor is the Averaged Stochastic Gradient Descent (ASGD) algorithm. We present an efficient implementation of ASGD that avoids dense vector operations. We also describe a translation invariant extension called Centered Averaged Stochastic Gradient Desce...
متن کاملTowards stability and optimality in stochastic gradient descent
Lemmas 1, 2, 3 and 4, and Corollary 1, were originally derived by Toulis and Airoldi (2014). These intermediate results (and Theorem 1) provide the necessary foundation to derive Lemma 5 (only in this supplement) and Theorem 2 on the asymptotic optimality of θ̄n, which is the key result of the main paper. We fully state these intermediate results here for convenience but we point the reader to t...
متن کاملA Bayesian Perspective on Generalization and Stochastic Gradient Descent
We consider two questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? Our work responds to Zhang et al. (2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. We show th...
متن کاملTrue Asymptotic Natural Gradient Optimization
We introduce a simple algorithm, True Asymptotic Natural Gradient Optimization (TANGO), that converges to a true natural gradient descent in the limit of small learning rates, without explicit Fisher matrix estimation. For quadratic models the algorithm is also an instance of averaged stochastic gradient, where the parameter is a moving average of a “fast”, constant-rate gradient descent. TANGO...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1107.2490 شماره
صفحات -
تاریخ انتشار 2011